Revisiting model assessment

In earlier lectures, we’ve discussed how we can simulate data from models, and also how we fit models to data. Today we’re going to exploit these ideas and introduce another method for model assessment.

Libraries for this lecture

library(tidyverse)
library(ggResidpanel)
library(DHARMa)
library(patchwork)

Review of Q-Q plots

Our models (so far) assume that the residuals are approximately normally distributed. We assessed this before by comparing the quantiles of the standardised residual distribution with the theoretical quantiles of a normal distribution. Let’s remind ourselves of this. Let’s simulate residuals from a normal(0,sd = 3) distribution, and compare the quantiles with the theoretical quantiles of a normal distribution.

In fact, let’s do this with 20 simulated points:

set.seed(395712)
sim1 <- rnorm(20, mean = 0,sd = 3)
qqnorm(sim1)
qqline(sim1, col = "green")

The points in the bottom left above the line are sample points that are closer to 0 than expected; similarly the points on the upper right below the line are sample points that are closer to 0 than expected.

We know that the data come from a normal distribution, so this qq-plot gives you some indication of how normally distributed data can deviate from the theoretical distribution.

Let’s try again with a different seed. This simulated data distribution has a somewhat longer right tail than the theoretical normal distribution.

set.seed(24474)
sim2 <- rnorm(20, mean = 0,sd = 3)
qqnorm(sim2)
qqline(sim2, col = "green")

Review traditional model assessment plots using ggResidpanel

Let’s simulate data from a simple (simulated) experiment: We measure leaf temperature in drought-resistant and wild-type tomato plants exposed to 7 days of drought.

In this first experiment, we measure 1 leaf per plant, so we can treat each leaf as an independent measurement.

Simulated leaf temperature data:

library(tidyverse)
set.seed(1234)
leaf_temp <- tibble(PlantID = 1:32,
                    Genotype = rep(c("WT","m"), each = 16),
                    Temp = rep(c(28.5,26), each = 16) + rnorm(32, 0,3))
ggplot(leaf_temp, aes(x=Genotype, y=Temp, col=Genotype))+
  geom_boxplot()+
  geom_point()

Fit a model to the leaf temperature data. What are the estimated model parameters (including sd estimate of the residual)?

model_leaf <- lm(Temp ~ Genotype, data = leaf_temp)
summary(model_leaf)
## 
## Call:
## lm(formula = Temp ~ Genotype, data = leaf_temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0678 -1.7583 -0.7326  1.8410  7.8281 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.4194     0.6908  36.795   <2e-16 ***
## GenotypeWT    2.1112     0.9770   2.161   0.0388 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.763 on 30 degrees of freedom
## Multiple R-squared:  0.1347, Adjusted R-squared:  0.1058 
## F-statistic:  4.67 on 1 and 30 DF,  p-value: 0.03881

We can check the normality of residuals assumption using the usual model assessment tool:

resid_panel(model_leaf)

The Package DHARMa (“Diagnostics for HierArchical Regression Models”)

The DHARMa package for model assessment was developed to test model assumptions for more complex models, for example when there is data structure or when the residuals are not normally distributed (generalised linear models).

Here are the main ideas:

  • From our data, fit a model using the function lm().
  • Using this model, new data for each observed value. The default number of simulations around each observed value is 250.

If our model is reasonable, our observed data are a random draw from our simulated distributions.

We define the “scaled residual” to be the quantile of the observed residual relative to its simulated distribution. The scaled residuals will have a uniform(0,1) distribution when the observed residuals are a random draw from the simulated distribution.

Model assessment using DHARMa: Experiment with 2 groups

Let’s simulate data from model_leaf model

set.seed(210621)
sim_dat <- simulateResiduals(model_leaf)

The object sim_dat is a list.

One object sim_dat$simulatedResponse in the list sim_dat is a 32 x 250 matrix where each row is a set of simulations around a predicted value.

str(sim_dat$simulatedResponse)
##  num [1:32, 1:250] 26 26.9 31.8 27.7 27.9 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:32] "1" "2" "3" "4" ...
##   ..$ : chr [1:250] "sim_1" "sim_2" "sim_3" "sim_4" ...

Let’s look at a histogram for one row, or one set of simulations:

n <- 23
dat <- tibble(sim1 = sim_dat$simulatedResponse[n,])
ggplot(dat, aes(x = sim1)) +
  geom_histogram(fill = "red") +
  geom_vline(xintercept = leaf_temp$Temp[n])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If we transpose this data frame, then each column will be a simulated distribution around the predicted value of an observation. In the code below, we create a histogram from the simulated data based upon the first 9 observations, and show where each observed value sits within that histogram.

Notice that these histograms are approximately centred around 27.5 which is the observed mean for the WT genotype.

When the model is correct, the observed value will be a random draw from the simulated distribution.

t_sim_dat<- t(sim_dat$simulatedResponse) %>%
  as_tibble()


plt <- list() #Create empty list

for (i in 1:9) {
  t_sim_dat$x <- unlist(t_sim_dat[,i])
plt[[i]] <- ggplot(t_sim_dat, aes(x = x)) +
  geom_histogram()+
  geom_vline(xintercept = leaf_temp$Temp[i], col = "red")
}

plt[[1]] + plt[[2]] + plt[[3]] + plt[[4]]+
  plt[[5]] + plt[[6]] + plt[[7]] + plt[[8]] + plt[[9]]

The scaled residuals are the quantiles of the observed values relative to the simulated data. For example, in the top left histogram, the scaled residual is approximately 0.2, whereas the middle left histogram the scaled residual is approximately 0.01. The scaled residuals will have a uniform distribution (0,1) when the model is a reasonable representation of the data.

resid_scale <- residuals(sim_dat)
resid_scale
##  [1] 0.172 0.752 0.948 0.008 0.756 0.800 0.432 0.420 0.356 0.308 0.452 0.196
## [13] 0.300 0.664 0.920 0.560 0.384 0.176 0.232 1.000 0.664 0.384 0.388 0.732
## [25] 0.276 0.096 0.792 0.232 0.536 0.220 0.916 0.404
plot(sim_dat)

The plot on the left is a QQ-plot comparing observed scaled residuals to the the theoretical quantiles of the uniform(0,1) distribution.

The plot on the right shows two boxplots of the scaled residuals - one for each genotype. We would expect that the bold line in the middle of the box (denoting the observed median) would align with 0.50 on the y-axis, and the edges of the box (denoting the 25th and 75th percentiles) would align with 0.25 and 0.75 on the y-axis.

DHARMa and mis-specified models

It’s useful to look at residual plots when the model is mis-specified. Suppose we measure plant hormone levels in our leaf temperature experiment, and they are log-normally distributed

set.seed(3)
leaf_temp$horm_pg.mg <- exp(rep(c(2,3.5),each = 16)+ rnorm(32,mean = 0, sd = 0.5)) 
ggplot(leaf_temp, aes(x = Genotype, y = horm_pg.mg, col = Genotype))+
  geom_boxplot()+
  geom_point()

model_horm <- lm(horm_pg.mg ~ Genotype, data = leaf_temp)
resid_panel(model_horm)

sim_horm <- simulateResiduals(model_horm)
plot(sim_horm)

Let’s review the precipitation/yield data we generated in an earlier lecture:

set.seed(13579)
annual_rain<-seq(11,100, 1)
yield <- 2 + 5*annual_rain - 0.04* annual_rain^2 + rnorm(90,0,10)
yield_dat<-tibble(annual_rain = annual_rain, yield=yield)
ggplot(yield_dat, aes(annual_rain, yield))+geom_point()

The correct model for the yield data includes a linear and quadratic term. We first fit the correct model and assess the model fit using DHARMa.

model_yield <- lm(yield~annual_rain+I(annual_rain^2), data = yield_dat)
summary(model_yield)
## 
## Call:
## lm(formula = yield ~ annual_rain + I(annual_rain^2), data = yield_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0516  -8.3180   0.3431   7.0240  26.8273 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.369346   4.795349  -0.494    0.622    
## annual_rain       5.175252   0.194563  26.599   <2e-16 ***
## I(annual_rain^2) -0.041463   0.001716 -24.168   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.823 on 87 degrees of freedom
## Multiple R-squared:  0.9009, Adjusted R-squared:  0.8986 
## F-statistic: 395.3 on 2 and 87 DF,  p-value: < 2.2e-16

The model coefficients are good estimates for the true coefficients.

sim_yield_dat <- simulateResiduals(model_yield)
plot(sim_yield_dat)

The plot on the left is a QQ-plot comparing observed scaled residuals to the the theoretical quantiles of the uniform(0,1) distribution.

The plot on the right shows the distribution of the scaled residuals relative to their predicted values (rank transformed). We would expect that the distributions are approximately uniform around each predicted value. In this case, the 3 bolded lines would be flat around the quartiles 0.25, 0.50 and 0.75.

We can compare this with what we would get when the model is mis-specified, for example, when we fail to include the quadratic structure in the model.

model_yield2 <- lm(yield~annual_rain, data = yield_dat)
summary(model_yield2)
## 
## Call:
## lm(formula = yield ~ annual_rain, data = yield_dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.504 -16.987   4.467  19.223  43.859 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  97.3633     6.7449  14.435  < 2e-16 ***
## annual_rain   0.5728     0.1101   5.204 1.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.13 on 88 degrees of freedom
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2267 
## F-statistic: 27.09 on 1 and 88 DF,  p-value: 1.26e-06

Notice that the annual_rain coefficient is much lower than the true value. The lower estimate is the negotiation between the initial upward trajectory and the later downward trajectory.

We can visualise this model for these data using geom_smooth:

ggplot(yield_dat, aes(annual_rain, yield))+
  geom_point()+
  geom_smooth(method = "lm")

Here are the DHARMa plots:

sim_yield_dat2 <- simulateResiduals(model_yield2)
plot(sim_yield_dat2)

The plot on the right does not pick up the model mis-specification, because overall the scaled residuals are approximately normally distributed.

The plot on the left shows that the distribution of scaled residuals are highly dependent upon the predicted value. One can clearly see that the model over-predicts yields for low and high annual rainfall, under-predicts yield for moderate annual rainfall. The red bold lines indicate a strong deviation from the expected dotted red horizontal lines at 0.25, 0.50 and 0.75.

We can compare these plots with the resid_panel() plots, which provide roughly the same information. The upper left plot shows a

resid_panel(model_yield2)

Model mis-specification in the variation around the mean

A key assumption in the models we fit so far are that the variation around the mean response is independent of the mean response. With biological data, it is not uncommon that the variation around the mean increases as the mean increases.

For example, a model of exponential growth is: size(t) = exp(bt + “noise”), where t = time. The coefficient “b” can be estimated by fitting a linear model to log(size(t)).

The normally distributed noise term is in the exponent; we say exp(“noise”) is log-normally distributed. In the code below, we simulate data following this idea. You can see in the plots below how the variation increases as the mean increases.

Plotting on the log scale makes the variation once again independent of the mean.

set.seed(1357)
annual_rain<-seq(11,100, 1)
yield <- exp(0.1*annual_rain - 0.0006*annual_rain^2)*rlnorm(90, sd = 0.5)
yield_dat_log<-tibble(annual_rain = annual_rain, yield=yield)
ggplot(yield_dat_log, aes(annual_rain, yield))+geom_point()

ggplot(yield_dat_log, aes(annual_rain, yield))+geom_point()+
  scale_y_log10()

In our first model, we assume there is no mean-variance relationship and we fit the usual linear model. What do you notice in the residual panel?

model_yield <- lm(yield~annual_rain+I(annual_rain^2), data = yield_dat_log)
summary(model_yield)
## 
## Call:
## lm(formula = yield ~ annual_rain + I(annual_rain^2), data = yield_dat_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.610 -12.283  -5.394   6.963  72.930 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      -21.958401  10.862029  -2.022  0.04629 * 
## annual_rain        1.388567   0.440708   3.151  0.00223 **
## I(annual_rain^2)  -0.004575   0.003886  -1.177  0.24229   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.25 on 87 degrees of freedom
## Multiple R-squared:  0.526,  Adjusted R-squared:  0.5151 
## F-statistic: 48.27 on 2 and 87 DF,  p-value: 7.869e-15
resid_panel(model_yield)

Here are the DHARMa plots. Interpret these plots in light of what you know about the true model. Compare them with the resid_panel panel.

sim_yield_dat <- simulateResiduals(model_yield)
plot(sim_yield_dat)

The code below represents the true model of the data. Have a look at the coefficient estimates. Do they match with the coefficients from the true model?

Compare the plots from resid_panel with the DHARMa plots.

model_log_yield <- lm(log(yield)~annual_rain+I(annual_rain^2), data = yield_dat_log)
summary(model_log_yield)
## 
## Call:
## lm(formula = log(yield) ~ annual_rain + I(annual_rain^2), data = yield_dat_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22726 -0.25640 -0.06531  0.34331  0.90691 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.797e-02  2.354e-01   0.246    0.806    
## annual_rain       9.232e-02  9.553e-03   9.664 1.96e-15 ***
## I(annual_rain^2) -5.273e-04  8.424e-05  -6.260 1.41e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4823 on 87 degrees of freedom
## Multiple R-squared:  0.7949, Adjusted R-squared:  0.7902 
## F-statistic: 168.6 on 2 and 87 DF,  p-value: < 2.2e-16
resid_panel(model_log_yield)

plot(simulateResiduals(model_log_yield))

This was a brief introduction to the DHARMa package. We’ll continue to use it when we introduce generalised linear models and the more traditional model assessment tools are not useful.